pacman::p_load(sf, sfdep, tmap, tidyverse, patchwork, DT)
# - Creates a package list containing the necessary R packages
# - Checks if the R packages in the package list have been installed
# - If not installed, will install the missing packages & launch into R environment.Take-home Exercise 1: Geospatial Analytics for Public Good
1 Overview
1.1 Background
As urban infrastructures, including public transportation systems like buses, taxis, mass rapid transit, public utilities, and roads, become increasingly digitised, the data generated becomes a valuable resource for tracking the movements of people and vehicles over space and time. This transformation has been facilitated by pervasive computing technologies such as Global Positioning System (GPS) and Radio Frequency Identification (RFID) tags on vehicles. An example of this is the collection of data on bus routes and ridership, amassed from the use of smart cards and GPS devices available on public buses.
The data collected from these sources is likely to contain valuable patterns that offer insights into various aspects of human movement and behavior within a city. Analyzing and comparing these patterns can provide a deeper understanding of urban mobility. Such insights can be instrumental in improving urban management and can also serve as valuable information for both public and private sector stakeholders involved in urban transportation services. This information can aid them in making informed decisions and gaining a competitive edge in their respective domains.
1.2 Objectives
The objective of this study is to utilize appropriate Local Indicators of Spatial Association (LISA) and Emerging Hot Spot Analysis (EHSA) techniques to uncover spatial and spatio-temporal mobility patterns among public bus passengers in Singapore.
This will include the following tasks:
- Geovisualisation and Analysis:
- Compute the passenger trips generated by origin at the hexagon level
- Display the geographical distribution of the passenger trips
- Explore spatial patterns revealed by the geovisualisation
| Peak hour period | Bus tap on time |
|---|---|
| Weekday morning peak | 6am to 9am |
| Weekday afternoon peak | 5pm to 8pm |
| Weekend/holiday morning peak | 11am to 2pm |
| Weekend/holiday evening peak | 4pm to 7pm |
- Local Indicators of Spatial Association (LISA) Analysis - Compute LISA of the passenger trips generate by origin - Display and draw statistical conclusions of LISA maps
- Emerging Hot Spot Analysis (EHSA)
- Perform Mann-Kendall Test by using the spatio-temporal local Gi* values
- Display EHSA maps of the Gi* values, describe the spatial patterns revealed
2 Loading Packages
The following packages will be used for this exercise:
The code chunk below, using p_load function of the pacman package, ensures that sfdep, sf, tmap and tidyverse packages are installed and loaded in R.
3 Data Preparation
3.1 The Data
The following data are used for this study:
- Aspatial:
- Passenger Volume by Origin Destination Bus Stops for August, September and October 2023, downloaded from LTA DataMall using API.
- Geospatial
- Bus Stop Location from LTA DataMall. It provides information about all the bus stops currently being serviced by buses, including the bus stop code (identifier) and location coordinates.
- hexagon, a hexagon layer of 250m is provided to replace the relative coarse and irregular Master Plan 2019 Planning Sub-zone GIS data set of URA.
3.2 Import & Preparation
3.2.1 Aspatial
We will be importing the Passenger Volume by Origin Destination Bus Stops dataset from August to October 2023, downloaded from LTA DataMall by using read_csv() or readr package.
odbus08 <- read_csv("data/aspatial/origin_destination_bus_202308.csv")
#odbus09 <- read_csv("data/aspatial/origin_destination_bus_202309.csv")
#odbus10 <- read_csv("data/aspatial/origin_destination_bus_202310.csv")(a) Attributes
glimpse() of the dplyr package allows us to learn more about the attribute information.
glimpse(odbus08)Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
#glimpse(odbus09)
#glimpse(odbus10)Insights:
- There are 7 variables in the odbus08 tibble data, they are:
- YEAR_MONTH: Month in which data is collected
- DAY_TYPE: Weekdays or weekends/holidays
- TIME_PER_HOUR: Hour which the passenger trip is based on, in intervals from 0 to 23 hours
- PT_TYPE: Type of public transport, i.e. bus
- ORIGIN_PT_CODE: Origin bus stop ID
- DESTINATION_PT_CODE: Destination bus stop ID
- TOTAL_TRIPS: Number of trips
- We also note that values in ORIGIN_PT_CODE and DESTINATON_PT_CODE are in numeric data type.
(b) Unique Bus Stops
n_distinct() of the dplyr package allows us to count the unique bus stops in the data set.
n_distinct(odbus08$ORIGIN_PT_CODE)[1] 5067
The results reveal that there are 5067 distinct origin bus stops.
(c) Distributions of Trips
summary() function allows us to take a quick look into the distribution of the passenger trips
summary(odbus08$TOTAL_TRIPS) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.00 4.00 21.04 13.00 35049.00
Insights: - Average passenger volume is 21, minimum is 1, and maximum is 35049 - 75% of the time, there are ≤13 passengers
max <- odbus08 %>% filter(TOTAL_TRIPS==35049)
max# A tibble: 1 × 7
YEAR_MONTH DAY_TYPE TIME_PER_HOUR PT_TYPE ORIGIN_PT_CODE DESTINATION_PT_CODE
<chr> <chr> <dbl> <chr> <chr> <chr>
1 2023-08 WEEKDAY 18 BUS 46101 46211
# ℹ 1 more variable: TOTAL_TRIPS <dbl>
Further investigation suggests that the maximum passenger volume was from a weekday at 6pm - 7pm, from original bus stop 46101 to bus stop 46211.
(a) Convert Data Type
as.factor() can be used to convert the variables ORIGIN_PT_CODE and DESTINATON_PT_CODE from numeric to categorical data type. We use glimpse() again to check the results.
odbus08$ORIGIN_PT_CODE <- as.factor(odbus08$ORIGIN_PT_CODE)
odbus08$DESTINATION_PT_CODE <- as.factor(odbus08$DESTINATION_PT_CODE)
glimpse(odbus08)Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <fct> 04168, 04168, 80119, 80119, 44069, 44069, 20281, 2…
$ DESTINATION_PT_CODE <fct> 10051, 10051, 90079, 90079, 17229, 17229, 20141, 2…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
Note that both of them are in factor data type now.
(b) Duplicates Check
Before moving on to the next step, it is a good practice for us to check for duplicated records to prevent double counting of passenger trips.
duplicate <- odbus08 %>%
group_by_all() %>%
filter(n()>1) %>%
ungroup()
duplicate# A tibble: 0 × 7
# ℹ 7 variables: YEAR_MONTH <chr>, DAY_TYPE <chr>, TIME_PER_HOUR <dbl>,
# PT_TYPE <chr>, ORIGIN_PT_CODE <fct>, DESTINATION_PT_CODE <fct>,
# TOTAL_TRIPS <dbl>
Results confirm that there are no duplicated records found.
(c) Extracting the Study Data
In our study, we would like to know patterns for 4 peak hour periods. Therefore, we can create a new variable period using the ifelse() that states whether an observation occurred during peak period using the code chunk below.
peak <- odbus08 %>%
# Weekday morning peak
mutate(period= ifelse(DAY_TYPE=="WEEKDAY" & (TIME_PER_HOUR >= 6 & TIME_PER_HOUR <= 9), "WDM",
# Weekday afternoon peak
ifelse(DAY_TYPE=="WEEKDAY" & (TIME_PER_HOUR >= 17 & TIME_PER_HOUR <= 20), "WDA",
# Weekend/holiday morning peak
ifelse(DAY_TYPE=="WEEKENDS/HOLIDAY" & (TIME_PER_HOUR >= 11 & TIME_PER_HOUR <= 14), "WEM",
# Weekend/holiday evening peak
ifelse(DAY_TYPE=="WEEKENDS/HOLIDAY" & (TIME_PER_HOUR >= 16 & TIME_PER_HOUR <= 19), "WEE",
# Return off-peak if neither of the peak hour periods
"Off-peak")))))We can then filter for peak-period data using the newly created period column and aggregate the total trips for each origin bus stop during peak period.
peakperiods <- peak %>%
# filter helps to keep records that occurred during period periods
filter(period !="Off-peak") %>%
# aggregate the total passenger trips for each origin bus stop
group_by(period, ORIGIN_PT_CODE) %>%
summarise(TRIPS=sum(TOTAL_TRIPS))Let’s visualise the proportions of passenger volumes for each peak period.
ggplot(data=peakperiods,
aes(x=period,y=TRIPS))+
geom_bar(stat="identity") +
theme(legend.position="none")
#color="black",
#fill="light blue") +
labs(title = "Frequency of Trip for each Peak Period",
x = "Peak Period)",
y = "Frequency")$x
[1] "Peak Period)"
$y
[1] "Frequency"
$title
[1] "Frequency of Trip for each Peak Period"
attr(,"class")
[1] "labels"
3.2.2 Geospatial
(a) Bus Stop Shapefile
In this section, we import BusStop shapefile into RStudio using st_read() function of sf package. This data provides the locations of all bus stops as at Q2 of 2023. crs = 3414 ensures coordinate reference system (CRS) is 3414, which is the EPSG code for the SVY21 projection used in Singapore.
busstop <- st_read(dsn = "data/geospatial",
layer = "BusStop") %>%
st_transform(crs = 3414)Reading layer `BusStop' from data source
`C:\kytjy\ISSS624\Take-Home_Ex\Take-Home_Ex1\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
The imported shape file is simple features object of sf. From the output, we can see that there are 5161 points with 3 fields, and confirm that the datum SVY21 is correct.
mapview::mapview(busstop)(b) Hexagon Layer
A hexagonal grid is used to replace the relative coarse and irregular Master Plan 2019 Planning Sub-zone GIS data set of URA. Hexagons have a number of advantages over these other shapes:
- The distance between the centroid of a hexagon to all neighboring centroids is the same in all directions.
- The lack of acute angles in a regular hexagon means that no areas of the shape are outliers in any direction.
- All neighboring hexagons have the same spatial relationship with the central hexagon, making spatial querying and joining a more straightforward process.
- Unlike square-based grids, the geometry of hexagons are well-structured to represent curves of geographic features which are rarely perpendicular in shape, such as rivers and roads.
- The “softer” shape of a hexagon compared to a square means it performs better at representing gradual spatial changes.
Step 1: Create Hexagonal Grids
We first create a hexagonal grid layer of 250m (refers to the perpendicular distance between the centre of the hexagon and its edges) with st_make_grid, and st_sf to convert the grid into an sf object with the codes below.
st_make_grid function is used to create a grid over a spatial object. It takes 4 arguments, they are: - x: sf object; the input spatial data - cellsize: for hexagonal cells the distance between opposite edges in the unit of the crs the spatial data is using. In this case, we take cellsize to be 250m * 2 = 500m
- what: character; one of:
"polygons","corners", or"centers" - square: indicates whether you are a square grid (TRUE) or hexagon grid (FALSE)
area_hexagon_grid = st_make_grid(busstop, 500, what = "polygons", square = FALSE)
# Converts grid to sf
hexagon_grid_sf = st_sf(area_hexagon_grid) %>%
# Assign unique ID to each grid
mutate(grid_id = 1:length(lengths(area_hexagon_grid)))Step 2: Remove grids with no bus stops
Secondly, we count the number of points in each grid and keep grids with bus stops using the code chunks below.
# Create a column containing the count of bus stops in each grid
hexagon_grid_sf$busstops = lengths(st_intersects(hexagon_grid_sf, busstop))
# Remove if no bus stop in side that grid, ie only keep hexagons with bus stops
hexagon_w_busstops = filter(hexagon_grid_sf, busstops > 0)Before we proceed, let’s confirm that all bus stops have been accounted for in our hexagon layer.
sum(hexagon_w_busstops$busstops)[1] 5161
This is in line with the 5161 points of the busstop shapefile.
Lastly, using tm_shape of tmap, we can quickly visualise the results of the hexagon grids we have created.
#qtm(hexagon_w_busstops, fill="busstops")
tmap_mode("view")
hex = tm_shape(hexagon_w_busstops)+
tm_fill(
col = "busstops",
palette = "Blues",
style = "cont",
title = "Number of Bus Stops",
id = "grid_id",
showNA = FALSE,
alpha = 0.6,
popup.format = list(
grid_id = list(format = "f", digits = 0)
)
)+
tm_borders(col = "grey40", lwd = 0.7)
hex(a) Combining Busstop and hexagon layer
Code chunk below populates the grid ID (i.e. grid_id) of hexagon_w_busstops sf data frame into busstop sf data frame.
bs_wgrids <- st_intersection(busstop, hexagon_w_busstops) %>%
select(BUS_STOP_N,BUS_ROOF_N,LOC_DESC, grid_id, busstops) %>%
st_drop_geometryst_intersection()is used to perform point and polygon overly and the output will be in point sf object.select()of dplyr package is then use to retain only BUS_STOP_N and SUBZONE_C in the busstop_mpsz sf data frame.fivebus stops are excluded in the resultant data frame because they are outside of Singapore bpundary.
Before we proceed, let’s perform a duplicates check on bs_wgrids.
duplicate2 <- bs_wgrids %>%
group_by_all() %>%
filter(n()>1) %>%
ungroup()
duplicate2# A tibble: 8 × 5
BUS_STOP_N BUS_ROOF_N LOC_DESC grid_id busstops
<chr> <chr> <chr> <int> <int>
1 43709 B06 BLK 644 1904 7
2 43709 B06 BLK 644 1904 7
3 58031 UNK OPP CANBERRA DR 2939 7
4 58031 UNK OPP CANBERRA DR 2939 7
5 51071 B21 MACRITCHIE RESERVOIR 3081 6
6 51071 B21 MACRITCHIE RESERVOIR 3081 6
7 97079 B14 OPP ST. JOHN'S CRES 5037 5
8 97079 B14 OPP ST. JOHN'S CRES 5037 5
Results displayed 4 genuine duplicated records. We remove these to prevent double-counting.
The code chunk below helps retain unique records.
bs_wgrids <- unique(bs_wgrids)(c) Append grid ID to peakperiods
Transpose each peak period period as a columns using pivot_wider() of tidyr package will allow us to create further variables at a bus stop level. We replace NA values with 0 to reflect when there are no traffic for certain periods.
peakperiods_wide <- pivot_wider(peakperiods,
names_from = "period",
values_from = "TRIPS")
peakperiods_wide["WDA"][is.na(peakperiods_wide["WDA"])] <- 0
peakperiods_wide["WDM"][is.na(peakperiods_wide["WDM"])] <- 0
peakperiods_wide["WEE"][is.na(peakperiods_wide["WEE"])] <- 0
peakperiods_wide["WEM"][is.na(peakperiods_wide["WEM"])] <- 0
#length(unique(peakperiods$ORIGIN_PT_CODE))
glimpse(peakperiods_wide)Rows: 5,067
Columns: 5
$ ORIGIN_PT_CODE <fct> 01012, 01013, 01019, 01029, 01039, 01059, 01109, 01112,…
$ WDA <dbl> 8448, 7328, 3608, 9317, 12937, 2133, 322, 45010, 27233,…
$ WDM <dbl> 1973, 952, 1789, 2561, 2938, 1651, 161, 8492, 9015, 424…
$ WEE <dbl> 3208, 2796, 1623, 4244, 7403, 1190, 88, 21706, 11927, 6…
$ WEM <dbl> 2273, 1697, 1511, 3272, 5424, 1062, 89, 14964, 8278, 61…
Notice that there are 5067 origin bus stops from the peakperiods table, compared to the 5161 bus stops from LTA’s BusStop shape file. This could be due to timing difference – LTA’s BusStop shapefile is as of July 2023, while peakperiod is based on Aug 2023.
We can now append the grid ID from bs_wgrids data frame onto peakperiods_wide data frame.
origin_grid <- left_join(peakperiods_wide, bs_wgrids,
by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
rename(ORIGIN_BS = ORIGIN_PT_CODE) %>%
group_by(grid_id)
# summarise(WDA = sum(WDA), WDM = sum(WDM), WEE = sum(WEE), WEM = sum(WEM))
glimpse(origin_grid)Rows: 5,079
Columns: 9
Groups: grid_id [1,506]
$ ORIGIN_BS <chr> "01012", "01013", "01019", "01029", "01039", "01059", "0110…
$ WDA <dbl> 8448, 7328, 3608, 9317, 12937, 2133, 322, 45010, 27233, 932…
$ WDM <dbl> 1973, 952, 1789, 2561, 2938, 1651, 161, 8492, 9015, 4240, 5…
$ WEE <dbl> 3208, 2796, 1623, 4244, 7403, 1190, 88, 21706, 11927, 6221,…
$ WEM <dbl> 2273, 1697, 1511, 3272, 5424, 1062, 89, 14964, 8278, 6198, …
$ BUS_ROOF_N <chr> "B03", "B05", "B04", "B07", "B09", "B08", "TMNL", "B07", "B…
$ LOC_DESC <chr> "HOTEL GRAND PACIFIC", "ST JOSEPH'S CH", "BRAS BASAH CPLX",…
$ grid_id <int> 3292, 3292, 3292, 3323, 3354, 3324, 3324, 3292, 3324, 3292,…
$ busstops <int> 8, 8, 8, 7, 8, 7, 7, 8, 7, 8, 7, 7, 8, 7, 7, 7, 7, 7, 7, 7,…
(d) Duplicates Check
Before continue, it is a good practice for us to check for duplicating records.
duplicate3 <- origin_grid %>%
group_by_all() %>%
filter(n()>1) %>%
ungroup()
duplicate3# A tibble: 0 × 9
# ℹ 9 variables: ORIGIN_BS <chr>, WDA <dbl>, WDM <dbl>, WEE <dbl>, WEM <dbl>,
# BUS_ROOF_N <chr>, LOC_DESC <chr>, grid_id <int>, busstops <int>
The results show that there are no duplicates found.
(e) Variable Creation for Geospatial Analysis
origin_grid <- origin_grid %>%
mutate(totaltrips=WDA+WDM+WEE+WEM)(d) Retrieve Geometry
origin_gridwgeom <- inner_join(origin_grid,
hexagon_w_busstops,
by = "grid_id")
origin_gridwgeom <- st_as_sf(origin_gridwgeom)4 Geovisualisation & Analysis
tmap_mode("view")
tm_shape(origin_gridwgeom)+
tm_fill("WDA",
style = "quantile",
palette = "Blues",
title = "Passenger trips") +
tm_layout(main.title = "Passenger Volume during Weekday Afternoon Peak Hour",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE) +
tm_borders(alpha = 0.5) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: Passenger trips data from LTA",
position = c("left", "bottom"))